This dataset is public and available for research purposes only. The link to the website is here: http://www3.dsi.uminho.pt/pcortez/wine/.
Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The dataset has 13 variables and it’s divided into two parts. First, the physicochemical section with 11 variables. Second, the quality variable for the score of the expert from 0 to 10. To use the quality variable as classification, a factor version of this variable is needed.
df$quality.factor <- as.factor(df$quality)
str(df)
## 'data.frame': 4898 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ quality.factor : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality quality.factor
## Min. : 8.00 Min. :3.000 3: 20
## 1st Qu.: 9.50 1st Qu.:5.000 4: 163
## Median :10.40 Median :6.000 5:1457
## Mean :10.51 Mean :5.878 6:2198
## 3rd Qu.:11.40 3rd Qu.:6.000 7: 880
## Max. :14.20 Max. :9.000 8: 175
## 9: 5
To clarify these charts: - Blue Line: Specify the median. - Red Line: Specify the mean. - Dashed Black Line: Specify the quantiles. - Dashed Orange Line: Specify the percentile (0.01 & 0.99). We can see that all of these measures have a (almost) normal distribution and some outliers but doesn’t affect the distribution.
Since we have a categorical variable for quality, let’s focus on divide it into 3 categorical specifications (A is the best & C is the worst).
# creating a new variable (quality catagories) "A" between 10-8, "B" between 7-5
# "C" between 4-0
df$quality.catagories <- ifelse(df$quality >= 8, "A",
ifelse(df$quality >= 5, "B",
ifelse(df$quality >= 0, "C",
"Other")))
After that, we will use one of these categorical specifications to draw plots with physicochemical variables to know the specifications of the best white wine.
ggplot(aes(df$quality.catagories), data = df) +
geom_histogram(stat = "count")
Above is a histogram that plot the qualit.catagories.
It appears that most of the features for the best wine showed a normal distribution except the residual.sugar and alcohol. Also there are some positive outliers in almost each of these specifications.
As we can see there is a long tail in the first plot and most of the data are under 20. When we did Log10 on the second plots you can spot the difference clearly.
The dataset contains 13 variables (11 physicochemical variables, one categorical variable and one variable for indexing) and 4898 observations. The 11 physicochemical features are: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density ,pH, sulphates, and alcohol. and all of them are numeric, while the remaining variable is quality and even when it is written with numbers but it represented as categorical since it is ratings by the experts.
The main feature is the quality variable since it represent the ratings of the experts.
There is chlorides, density, and alcohol.
Yes. quality.catagorical from quality.
In residual.sugar there was a long tail with most of the data were under 20. So I did a log 10 transformation on its x-axis to see the transformation and it was very clear.
We can see from the plots matrix above that the strongest correlation is between quality and alcohol with (r=0.436). Also, we can spot that the correlation between density and residual.sugar is strong positive with (r=0.839) and density and total.sulfur.dioxide with (r=0.53). And in terms of negative correlation, the correlation between density and alcohol is (r=-78), and between density and quality is (r=-0.307).
As we can see from the matrix box plots that there are two features that really affect the quality and the ratings of the wine. alcohol with a strong positive and density with a strong negative. Other features don’t have that much effect as these two with quality. Let’s focus on it.
From the density plot we can see that if the density is high we expect to see the rating lower. On the other side, as long as the alcohol is high we expect the rating of the quality is high.
## r = -0.7801376
Lets plot some of the other chimical features to see the relationship between them:
## r = 0.8389665
## r = 0.5298813
## r = -0.4506312
We saw above that there is a strong correlation between
qualityandalcoholwith (r=0.436). Also, we saw that the correlation betweendensityandresidual.sugaris strong positive with(r=0.839) anddensityandtotal.sulfur.dioxidewith (r=0.53). And in terms of negative correlation, the correlation betweendensityandalcoholis (r=-78), and betweendensityandqualityis (r=-0.307).
Yes. there was between
alcoholanddensityand betweenresidual.sugaranddensityalso.
It was between
residual.sugaranddensitywith (r = 0.839).
With a strong negative relationship between alcohol and density, we can see from the plot above that the best wine tend to have much alcohol, and the worst tend to have higher density.
With holding density in both plots, we can see in the left plot that a better wine have more residual sugar. On the right side we can see also that an alcoholic wine have more residual sugar.
We can see that the best wine tend to have much alcohol, and the worst tend to have higher density. ALso, we can see that a better wine have more residual sugar and an alcoholic wine have more residual sugar.
No, I didn’t see anything.
Since the correlation between density and quality is negative with (r = -0.3071233) we can see from the plot above that when the density is high we expect to see the quality to be low.
We can spot from the plot that the correlation between residual.sugar and density is string positive with (r = 0.838). So as long the residual sugar is heigh we excpect the density to be higher.
With a strong negative relationship between alcohol and density (r = -0.7801376), we can see from the plot above that the best wine tend to have much alcohol, and the worst tend to have higher density.
It was an interesting project and an interesting dataset to investigate it. The gradation of the project was fascinating, starts from univariate to bivariate to multivariate. Knowing which chemical property has the highest effect on wine’s quality. Also between the properties itself and how much that have an effect on the quality in the end. What I think it would give more accurate is to have more observations wines to test since we don’t have any wines that rated with 10 10. In the future, we could implement these modules on other sets, maybe the red wines that in the description of this set. Also, we may go deeper into this data set and explore more relation between the chemical properties.